We are now entering the phase of doing serious statistics: we run regression models, inspect them, plot, and tabulate them.

For this exercise, let’s pretend to be researchers interested in the determinants of the subjective risk of being treated for COVID-19 in a hospital. We suspect that age plays an important role, something we have already assessed by visualizing a similar relationship. So… we need the GESIS Panel COVID-19 survey data again:

library(dplyr)
library(haven)
library(sjlabelled)

gp_covid <-
  read_sav(
    "../data/ZA5667_v1-1-0.sav"
  ) %>% 
  set_na(na = c(-1:-99, 97)) %>% 
  mutate(
    likelihood_hospital = hzcy003a,
    age_cat = as.factor(age_cat)
  ) %>% 
  remove_all_labels()

We start our analysis by checking the relationship with an ANOVA. We also include some very essential covariates: gender, political orientation, and marital status.

1

Run an ANOVA on the proposed relationship between the subjective likelihood of treatment in a hospital and age. Also, include all the covariates.
We had a similar analysis before… you can simply add the control variables with +.
serious_anova <- 
  aov(
    likelihood_hospital ~ age_cat + sex + political_orientation + marstat,
    data = gp_covid
  )

summary(serious_anova)
##                         Df Sum Sq Mean Sq F value Pr(>F)    
## age_cat                  9    346   38.40  23.868 <2e-16 ***
## sex                      1      1    0.99   0.617  0.432    
## political_orientation    1      0    0.11   0.071  0.789    
## marstat                  1      1    1.06   0.657  0.418    
## Residuals             3092   4974    1.61                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 660 observations deleted due to missingness

Alright, there’s definitely something in the data we should investigate. Oftentimes this is considered bad practice, but we could try to create a median split for the dependent variable and run a logistic regression.

2

Create a median split for the variable likelihood_hospital and run a logistic regression with all variables. What do you see in the results?
You can create new variables fairly easily with the function dplyr::mutate() in combination with the ifelse() function.
gp_covid <-
  gp_covid %>% 
  mutate(
    likelihood_hospital_cut =
      ifelse(
        likelihood_hospital > median(likelihood_hospital, na.rm = TRUE),
        1,
        0
      )
  )

serious_logistic_regression <- 
  glm(
    likelihood_hospital_cut ~ age_cat + sex + political_orientation + marstat,
    data = gp_covid,
    family = binomial(link = "logit")
  )

summary(serious_logistic_regression)
## 
## Call:
## glm(formula = likelihood_hospital_cut ~ age_cat + sex + political_orientation + 
##     marstat, family = binomial(link = "logit"), data = gp_covid)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.259  -1.008  -0.756   1.242   2.070  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -2.1525329  0.3987424  -5.398 6.73e-08 ***
## age_cat2               0.5570465  0.4001281   1.392  0.16387    
## age_cat3               0.2237200  0.4070216   0.550  0.58256    
## age_cat4               0.8552502  0.3851614   2.220  0.02638 *  
## age_cat5               1.0188140  0.3837725   2.655  0.00794 ** 
## age_cat6               1.2015271  0.3783196   3.176  0.00149 ** 
## age_cat7               1.5248309  0.3643355   4.185 2.85e-05 ***
## age_cat8               1.7367353  0.3739040   4.645 3.40e-06 ***
## age_cat9               1.9148385  0.3750669   5.105 3.30e-07 ***
## age_cat10              1.9574845  0.3732600   5.244 1.57e-07 ***
## sex                    0.0687836  0.0777806   0.884  0.37652    
## political_orientation  0.0247512  0.0206458   1.199  0.23059    
## marstat               -0.0007526  0.0481534  -0.016  0.98753    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4067.1  on 3104  degrees of freedom
## Residual deviance: 3888.4  on 3092  degrees of freedom
##   (660 observations deleted due to missingness)
## AIC: 3914.4
## 
## Number of Fisher Scoring iterations: 4
# Interpretation:
# With increasing age, the probability of thinking it's likely to be 
# hospitalized because of COVID-19 increases. But really?

Running one model is not enough. It may be that the assumptions for logistic regression are not fulfilled, or a reviewer simply doesn’t like these types of regressions. Instead, she proposes to run a binomial regression but with a cauchit link.

3

Run the same model with the sole difference of using a cauchit link. How would you interpret the different regression coefficients?
Have a look at the help page ?family to see how you can include a cauchit link.
serious_cauchit_regression <- 
  glm(
    likelihood_hospital_cut ~ age_cat + sex + political_orientation + marstat,
    data = gp_covid,
    family = binomial(link = "cauchit")
  )

summary(serious_cauchit_regression)
## 
## Call:
## glm(formula = likelihood_hospital_cut ~ age_cat + sex + political_orientation + 
##     marstat, family = binomial(link = "cauchit"), data = gp_covid)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2523  -1.0064  -0.7601   1.2431   2.0485  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)   
## (Intercept)           -2.548363   0.849671  -2.999  0.00271 **
## age_cat2               1.036283   0.875953   1.183  0.23680   
## age_cat3               0.502607   0.912892   0.551  0.58193   
## age_cat4               1.427898   0.852017   1.676  0.09376 . 
## age_cat5               1.605221   0.848333   1.892  0.05846 . 
## age_cat6               1.792792   0.843912   2.124  0.03364 * 
## age_cat7               2.072772   0.837850   2.474  0.01336 * 
## age_cat8               2.246052   0.840343   2.673  0.00752 **
## age_cat9               2.386031   0.840525   2.839  0.00453 **
## age_cat10              2.419927   0.840032   2.881  0.00397 **
## sex                    0.046430   0.069014   0.673  0.50109   
## political_orientation  0.017770   0.018113   0.981  0.32655   
## marstat               -0.004667   0.040372  -0.116  0.90796   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4067.1  on 3104  degrees of freedom
## Residual deviance: 3889.1  on 3092  degrees of freedom
##   (660 observations deleted due to missingness)
## AIC: 3915.1
## 
## Number of Fisher Scoring iterations: 5
# Interpretation of difference:
# I don't know...

Using different link functions is sometimes done as they provide different model fits. That’s definitely something we should investigate as well.

4

Compare both regression models with an ANOVA. Use the option test = "LRT" to perform a likelihood ratio test. What’s your interpretation?
A p-value considered as statistically significant would indicate a difference between the models.
anova(
  serious_logistic_regression, 
  serious_cauchit_regression,
  test = "LRT"
  )
## Analysis of Deviance Table
## 
## Model 1: likelihood_hospital_cut ~ age_cat + sex + political_orientation + 
##     marstat
## Model 2: likelihood_hospital_cut ~ age_cat + sex + political_orientation + 
##     marstat
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      3092     3888.4                     
## 2      3092     3889.1  0 -0.72287
# Interpretation of difference:
# It seems there is no big difference. We can make the reviewer happy with the
# cauchit models and still keep our results.